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Abstract — Many online social networks feature restrictive web 
interfaces which only allow the query of a user's local neighbor- 
hood through the interface. To enable analytics over such an 
online social network through its restrictive web interface, many 
recent efforts reuse the existing Markov Chain Monte Carlo 
methods such as random walks to sample the social network 
and support analytics based on the samples. The problem with 
such an approach, however, is the large amount of queries often 
required (i.e., a long "mixing time") for a random walk to reach 
a desired (stationary) sampling distribution. 

In this paper, we consider a novel problem of enabling a faster 
random walk over online social networks by "rewiring" the social 
network on-the-fly. Specifically, we develop Modified TOpology 
(MTO)-Sampler which, by using only information exposed by the 
restrictive web interface, constructs a "virtual" overlay topology 
of the social network while performing a random walk, and 
ensures that the random walk follows the modified overlay 
topology rather than the original one. We show that MTO- 
Sampler not only provably enhances the efficiency of sampling, 
but also achieves significant savings on query cost over real-world 
online social networks such as Google Plus, Epinion etc. 

I. Introduction 

A. Aggregate Estimation over Online Social Networks 

An online social network allows its users to publish contents 
and form connections with other users. To retrieve infor- 
mation from a social network, one generally needs to issue 
a individual-user query through the social network's web 
interface by specifying a user of interest, and the web interface 
returns the contents published by the user as well as a list of 
other users connected with the useiQ 

An online social network not only provides a platform for 
users to share information with their acquaintance, but also 
enables a third party to perform a wide variety of analytical 
applications over the social network - e.g., the analysis of 
rumor/news propagation, the mining of sentiment/opinion on 
certain subjects, and social media based market research. 
While some third parties, e.g., advertisers, may be able to 
negotiate contracts with the network owners to get access 
to the full underlying database, many third parties lack the 

'We currently focus on the undirected relationship between users. 



resources to do so. To enable these third-party analytical 
applications, one must be able to accurately estimate big- 
picture aggregates (e.g., the average age of users, the COUNT 
of user posts that contain a given word) over an online social 
network by issuing a small number of individual-user queries 
through the social network's web interface. We address this 
problem of third-party aggregate estimation in the paper. 

B. Existing Sampling Based Solutions and Their Problems 

An important challenge facing third-party aggregate esti- 
mation is the lack of cooperation from online social network 
providers. In particular, the information returned by each 
individual-user query is extremely limited - only containing 
information about the neighborhood of one user. Furthermore, 
almost all large-scale online social networks enforce limits 
on the number of web requests one can issue (e.g., 600 open 
graph queries per 600 seconds for FaceboolJ^] and 350 requests 
per hour for Twitte^jl. As a result, it is practically impossible 
to crawl/download most or all data from an online social 
network before generating aggregate estimations. There is also 
no available way for a third party to obtain the entire topology 
of the graph underlying the social network. 

To address this challenge, a number of sampling techniques 
have been proposed for performing analytics over an online 
social network without the prerequisite of crawling [ 10]— [ 12], 
[15]. The objective of sampling is to randomly select elements 
(e.g., nodes/users or edges/relationships) from the online social 
network according to a pre-determined probability distribution, 
and then to generate aggregate estimations based on the 
retrieved samples. Since only individual local neighborhoods 
(i.e., a user and the set of its neighbors) - rather than the 
entire graph topology - can be retrieved from the social 
network's web interface, to the best of our knowledge, all 
existing sampling techniques without prior knowledge of all 
nodes/edges are built upon the idea of performing random 
walks over the graph which only require knowledge of the 
local neighborhoods visited by the random walks. 

2 https://developers. facebook.com/docs/best-practices/ 
3 https://dev.twitter.com/docs/rate-limiting 



In literature, there are two popular random walk schemes: 
simple random walk and Metropolis Hastings random walk. 
Simple random walk (SRW) [17] starts from an arbitrary 
user, repeatedly hops from one user to another by choosing 
uniformly at random from the former user's neighborhood, 
and stops after a number of steps to retrieve the last user as 
a sample. When the simple random walk is sufficiently long, 
the probability for each user to be sampled tends to reach a 
stationary (probability) distribution proportional to each user's 
degree (i.e., the number of users connected with the user). 
Thus, based on the retrieved samples and knowledge of such 
a stationary distribution, one can generate unbiased estimations 
of AVG aggregates (with or without selection conditions) over 
all users in the social network. If the total number of users 
in the social network is availabl^] then COUNT and SUM 
aggregates can be answered without bias as well. 

Metropolis Hastings random walk (MHRW) is a random 
walk achieving any distribution (typically uniform distribution) 
constructed by the famous MH algorithm. As an extension of 
MHRW, based on the knowledge of all the ids of a graph, [11] 
suggests that we can conduct random jump (RJ), which jumps 
to any random verte)^ in the graph with a fixed probability 
in each step when it carries on the MHRW. Although MHRW 
can yield asymptotically uniform samples, which requires no 
additional processing for subsequent analysis, it is slower than 
SRW almost for all practical measurements of convergence, 
such as degree distribution distance, KS distance and mean 
degree error. According to [10] and [14], SRW is 1.5-8 times 
faster than MHRW. Thus we set the baseline as SRW, while 
we also include MHRW in the experimental section. 

A critical problem of existing sampling techniques, how- 
ever, is the large number of individual-user queries (i.e., web 
requests) they require for retrieving each sample. Consider the 
above-described simple random walk as an example. In order 
to reach the stationary distribution (and thereby an accurate 
aggregate estimation), one may have to issue a large number of 
queries as a "burn-in" period of the random walk. Traditional 
studies on graph theory found that the length of such a burn-in 
period is determined by the graph conductance - an intrinsic 
property of the graph topology (formally defined in Section|II]i. 
In particular, the smaller the conductance is, the longer the 
burn-in period will be (i.e., the more individual-user queries 
will be required by sampling). 

Unfortunately, a recent study [18] on real-world social net- 
works such as Facebook, Livejournal, etc. found the conduc- 
tance of their graphs to be substantially lower than expected. 
As a result, a random walk on these social networks often 
requires a large number of individual-user queries - e.g., 
approximately 500 to 1500 single random walk length for a 
real-world social network Livejournal of one million nodes 
to achieve acceptable variance distance [18]. One can see 
that, in order to retrieve enough samples to reach an accurate 

4 Which is the case for many real-world social networks whose providers 
publish the total number of users for advertising purposes. 

5 It may need the global topology or the whole user id space for generate 
random vertex, thus not viable for all online social networks. 



aggregate estimation, the existing sampling techniques may 
require a very large number of individual-user queries. 

C. Outline of Technical Results 

In this paper, we consider a novel problem of how to sig- 
nificantly increase the conductance of a social network graph 
by modifying the graph topology on-the-fly (during the third- 
party random walk process). In the following, we shall first 
explain what we mean by on-the-fly topology modification, 
and then describe the rationale behind our main ideas for 
topology modification. 

First, by topology modification we do not actually modify 
the original topology of the social network graph - indeed, 
no third party other than the social network provider has the 
ability to do so. What we modify is the topology of an overlay 
graph on which we perform the random walks. Fig [T] depicts 
an example: if we can decide that not considering a particular 
edge in the random walk process can make the burn-in period 
shorter (i.e., increase the conductance), then we are essentially 
performing random walks over an overlay graph on which this 
edge is removed. By doing so, we can achieve same accurate 
aggregate estimation with lower query cost. One can see that, 
with traditional random walk techniques, the overlay graph is 
exactly the same as the original social network graph. Our 
objective here is to manipulate edges in the overlay graph so 
as to maximize the graph conductance. 

It is important to note that the technical challenge here is 
not how edge manipulations can boost graph conductance - a 
simple method to reach theoretical maximum on conductance 
is to repeatedly insert edges to the graph until it becomes 
a complete graph. This requires the knowledge of all nodes 
in the social network, which a third-party does not have. The 
key challenge here is how to perform edge manipulations only 
based on the knowledge of local neighborhoods that a random 
walk has passed by, and yet increases the conductance of 
the entire graph in a significant manner. In the following, we 
provide an intuitive explanation of our main ideas for topology 
modification. 

To understand the main ideas, we first introduce the con- 
cepts of cross-cutting and non-cross-cutting edges intuitively 
with an example in Fig [Tj (we shall formally define these 
concepts in Section [II]). Generally speaking, if we consider 
a social network graph consisting of multiple densely con- 
nected components (e.g., S and S in Fig [TJ, then the edges 
connecting them are likely to be cross-cutting edges, while 
edges inside each densely connected component are likely 
non-cross-cutting ones. A key intuition here is that the more 
cross-cutting edges and/or the fewer non-cross-cutting edges a 
graph has, the higher its conductance is. For example, Graph G 
in Fig [T] has a low conductance (i.e., high burn-in period) as a 
random walk is likely to get "stucked" in one of the two dense 
components which are difficult to escape, given that there is 
only one cross-cutting edge (u,v). On the other hand, with 
far fewer non-cross-cutting edges and a few additional cross- 
cutting edges, G* has a much higher conductance as it is much 



Overlay graph G 




Original graph G 



Fig. 1. A concept of a random walk on the topologically modified overlay 
graph. 



easier now for a random walk to move from one component 
to the other. 

With the concepts of cross-cutting and non-cross-cutting 
edges, we develop Modify TOpology Sampler (MTO- 
S ampler), a topology manipulation technique which first de- 
termines^] whether a given edge in the graph is a cross-cutting 
edge based solely upon knowledge of the local neighborhood 
topology, and then removes the edge if it is non-cross-cutting. 
MTO-Sampler may also "move" an edge by changing a node 
connected to the edge if it is determined that, by doing so, 
the new edge is more likely to be a cross-cutting edge. We 
shall show in the paper that MTO-Sampler is capable of 
significantly improving the efficiency of random walks: For 
the example in Fig[T] MTO-Sampler is capable of reducing the 
mixing time (i.e., query cost of a random walk) by 97%. We 
also demonstrate through experimental results the significant 
improvement of efficiency achieved by MTO-Sampler for real- 
world social networks such as Epinions, Google Plus, etc. 

The main contributions of our approach include: 

• (Problem Novelty) We consider a novel problem of mod- 
ifying the graph topology on-the-fiy (during the random 
walk process) for the efficient third-party sampling of 
online social networks. 

• (Solution Novelty) We develop MTO-Sampler which 
determines whether an edge is (non-)cross-cutting based 
solely upon local neighborhood knowledge retrieved by 
the random walk, and then manipulates the graph topol- 
ogy to significantly improve sampling efficiency. 

• Our contributions also include extensive theoretical anal- 
ysis (on various social network models) and experimental 
evaluation on synthetic and real-world social networks 
as well as online at Google+ which demonstrate the 
superiority of our MTO-Sampler over the traditional 



sampling techniques. 

II. Preliminaries 

A. Model of Online Social Networks 

In this paper, we consider an online social network with an 
interface that allows input queries of the form 

q(v): SELECT * FROM D WHERE USER-ID = v, 

and responds with the information about user v (e.g., user 
name, self-description, user-published contents) as well as 
the list of all other users connected with v (e.g., w's friends 
in the network). This is a model followed by many online 
social networks - e.g., Google Plus, Facebook, etc - with the 
interface provided as either an end-user-friendly web page or 
a developer-specific API call. 

Consider the social-network topology as an undirected 
graph G(V, E), where each node in V is corresponding to 
a user in the social networl£] and each edge in E represents 
the connection between two users. One can see that the answer 
to query q(v) (v £ V) is a set of nodes N(v) C V, such that 
Vw £ N(v), there is an edge e : (u, v) £ E. We henceforth 
refer to N(v) as the neighborhood of v. We use k v to denote 
the degree of v - i.e., k v — \N(v)\. For abbreviation, we also 
write e : (u, v) as e uv . 

Running Example: We shall use, throughout this paper, 
the 22-node, Ill-edge, barbell graph shown (as the 
original graph G) in Fig [T] as a running example. 



6 Note that, as we shall prove in section III-A it is impossible to assert 
deterministically that an edge is cross-cutting. Nonetheless, it is possible to 
assert deterministically that an edge is non-cross-cutting. Thus, our algorithm 
has two possible outputs: non-cross-cutting or uncertain. We shall show in 
the paper that it outputs non-cross-cutting for a large number of (non-cross- 
cutting) edges in real-world social networks. 



B. Performance Measures for Sampling 

In the following, we shall discuss two key objectives for 
sampling: (1) minimizing bias - such that the retrieved samples 
can be used to accurately estimate aggregate query answers, 
and (2) reducing the number of queries required for sampling 
- given the stringent requirement often put in place by real- 
world social networks on the number of queries one can issue 
per day. 

Bias: In general, sampling bias is the "distance" between 
the target (i.e., ideal) distribution of samples and the actual 
sampling distribution - i.e., the probability for each tuple to be 
retrieved as a sample. We shall further discuss a concrete bias 
measure in the next subsection and an experimental measure 
in Section IV-A.3I 

Query Cost: To this end, we consider the number of unique 
queries one has to issue for the sampling process, as any 
duplicate query can be answered from local cache without 
consuming the query limit enforced by the social network 
provider. 

C. Random Walk 

A random walk is a Markov Chain Monte Carlo (MCMC) 
method which takes successive random steps on the above- 
described graph G according to a transition matrix P = 

7 Note that without introducing ambiguity, we use "node" and "social 
network user" interchangeably in this paper. 



(p U v)> u > v € V> where p uv represents the probability for 
the random walk to transit from node u to v. The premise 
here is that, after performing a random walk for a sufficient 
number of steps, the probability distribution for the walk to 
land on each node in G converges to a stationary distribution 
7r which then becomes the sampling distributiorj^] There are 
many different types of random walks, corresponding to the 
different designs of P and different stationary distributions. 
In this paper, we consider the simple random walk that has a 
stationary distribution of tt(v) = k v /(2\E\) for all v EV. 

Definition 1: (Simple Random Walk). Given a current 
node v, a simple random walk chooses uniformly at random 
a neighboring node u G N(v) and transit to u in the next step 
- i.e., 



p = 

J 7)7/ 



(1) 



l/k v if ueN(v), 
otherwise. 
One can see that each step of a simple random walk requires 
exactly one query (i.e., q(v) to identify the neighborhood 
of v and select the next stop u). Thus, the performance of 
sampling - i.e., the tradeoff between bias and query cost - 
is determined by how fast the random walk converges to the 
stationary distribution. Formally, we measure the convergence 
speed as the mixing time defined as follows. 

Definition 2: (Mixing Time) Given G : (V, E), after t 
steps of simple random walk, the relative point-wise distance 
between the current sampling distribution and the stationary 
distribution is 



A(t) = max 

u,v£V,v£N(u) 



IP* 



ir(v)\ 



7r(u) 



(2) 



where P^ v is the element of P l with indices u and v. The 
mixing time of the random walk is the minimum value of t 
such that A(t) < e where e is a pre-determined threshold on 
relative point-wise distance. 

One can see from the definition that the relative point-wise 
distance A(t) measures the bias of the random walk after t 
steps. Mixing time, on the other hand, captures the query cost 
required to reduce the bias below a pre-determined threshold e. 
In the following subsection, we describe a key characteristics 
of the graph that determines the mixing time - the conductance 
of the graph. 

D. Conductance: An Efficiency Indicator 

Intuitively, the conductance which indicates how fast the 
simple random walk converges to its stationary distribution, 
measures how "well-knit" a graph is. Specifically, the conduc- 
tance is determined by a cut of the graph G - i.e., a partition 
of V into two disjoint subsets S and S - which minimizes the 
ratio between the probability for the random walk to move 
from one partition to the other and the probability for the 
random walk to stay in the same partition. Formally, we have 
the following definition. 

8 That is, if we take the end node as a sample 



Definition 3: (Conductance). The conductanc^jof a graph 

G : (V, E) is 

MQ) = min \{e nv \u£S,v£S}\ 

scv min{\{e av \u £ S,v € V}\, \{e uv \u e S,v e V}\} 
The relationship between the graph conductance and the 
mixing time of a simple random walk is illustrated by the 
following inequality [3]: 



(1 - 2$(G))' < A(t) < 



2\E\ 



min^gy k v 



1 



$(G) ; 



(3) 



One can see that the graph conductance ^(G) ranges between 
and 1 - and the larger $(G) is, the smaller the mixing time 
will be (for a fixed threshold e). Also note from ^ the log 
scale relationship between $(G) and the mixing time. This 
indicates a small change on $(G) may lead to a significant 
change of the mixing time. Let 



2\E\ 



mm^v 



L g(G) 2 V 
k v \ 2 ) 



< e 



1 ~ log(l - $(G)2) 10 



c 



=> t > - 



log(l - 4>(G) 2 ) 



log(c/e) 



(4) 
(5) 
(6) 



Here c = 



gig] 



For example, increasing conductance 



from 0.010 to 0.012 will change the mixing time from 
46050.5 • log(c/e) to 31979.1 • log(c/e). 



Running Example: The conductance of the barbell 
graph in the running example is $(G) = 1/(( 1 2 1 ) +1) = 
0.018. The corresponding (and unique) S and S are 
shown in Fig [T] Correspondingly, the mixing time to 
reach a relative point- wise distance of A(t) < e is 
bounded from above by 14212.3 • log(22.2/e). We shall 
show throughout the paper how our on-the-fiy topology 
modification techniques can significantly increase con- 
ductance and reduce the mixing time for this running 
example. 



E. Key for Conductance: Cross-Cutting Edges 

A key observation from Definition [3] is that the graph 
conductance critically depends on the number of edges which 
"cross-cut" S and S - i.e., |{e ut ,|u € 5, v & S}\. The 
more such cross-cutting edges there are, the higher the graph 
conductance is likely to be. On the other hand, since a non- 
cross-cutting edge is only counted in the denominator, the 
more non-cross-cutting edges there are in the graph, the lower 
the conductance is likely to be. Formally, we define cross- 
cutting edges as follows. 

Definition 4: (Cross-cutting edges). For a given graph 
G(V,E), an edge e uv is a cross-cutting edge if and only if 

9 Rigidly, the conductance is determined by both the graph topology and 
the transition matrix of the random walk. Here we tailor the definition to the 
simple random walk considered in this paper. 



there exists S C V such that u £ S, v £ S where S = V\S, 
and 



<p(S) 



\{e uv \u £ S, v £ S}\ 



mm{\{e uv \u £ S,v £ V}\,\{e uv \u £ S,v £ V}\} 



takes the minimum value among all possible S C V. 
We note that in large graphs such as online social networks, it 
is reasonable to assume that the number of cross-cutting edges 
is relatively small when compared to total number of edges in 
S ov S. 

One can see that our objective of on-the-fiy topology 
modification is then to increase the number of cross-cutting 
edges and decrease the number of non-cross-cutting edges as 
much as possible. We describe our main ideas for doing so in 
the next section. 



Running Example: For the barbell graph, adding any 
edge between the two halves of the graph produces a 
new cross-cutting edge, and increases the graph conduc- 
tance from $(G) = 0.018 to 0.035 - i.e., the mixing- 
time will be reduced to 3758.1/14212.3 = 0.264 - a 
significant reduction of 75%. 



III. Main Ideas of On-The-Fly Topology 
Modification 

A. Technical Challenges: Negative Results 



One can see from Section II-E that the key for increasing 
the conductance of a social network (and thereby reducing 
the query cost of sampling) through topology modification 
is to determine whether an edge is a cross-cutting edge or 
not. Unfortunately, the deterministic identification of a cross- 
cutting edge is a hard problem (in the worst case) even if the 
entire graph topology is given as prior knowledge, as shown 
in the following theorem. 

Theorem 1: The problem of determining whether an edge 
is cross-cutting or not is NP-hard. 

Proof: Consider the case of equal transition probability 
for each edge. The problem of finding all cross-cutting edges is 
equivalent with finding the optimum cut of the graph according 
to the Cheeger constant - a problem proved to be NP-hard [7]. 

■ 

Given the worst-case hardness result, we now consider 
the best-case scenario - i.e., is there any graph topology 
(which is not the worst-case input, of course) for which it 
is possible to efficiently identify cross-cutting edges? It is 
easy to see that, if the entire graph topology is given, then 
there certainly exist such graphs - with the original graph in 
Fig [T] being an example - for which the cross-cutting edge(s) 
can be straightforwardly identified. Nonetheless, our interest 
lies on making such identifications based solely upon local 
neighborhood knowledge - because of the aforementioned 
restrictions of online social-network interfaces. The following 
theorem, unfortunately, shows that it is impossible for one to 
deterministically confirm the cross-cutting nature of an edge 
unless the entire graph topology has been crawled. 




Fig. 2. By cloning graph G, we can always construct graph G' such that 
simply adding an edge e : (vi,Vj) may decrease the conductance. 



Theorem 2: Given the local neighborhood topology of ver- 
tices accessed by a third-party sampler, {vi, . ..,«&} C V 
in G(V,E) where k < \V\, for any given edge e : (vi,vj), 
there must exist a graph G'(V',E') such that: (1) e : (vi,Vj) 
is not a cross-cutting edge for G', and (2) G and G' are 
indistinguishable from the view of the sampler - i.e., there 
exists {v[, . . . , v' k } C V which have the exactly same local 
neighborhood as {vx, ...,«*}. 

Proof: The construction of G' can be stated as follows: 
First, insert n extra vertices v®, . . . , u° and e extra edges into 
the graph, such that Ve : (yi,Vj) £ E, there is e° : (v^,Vj) 
in the new graph. Note that at this moment, there is no edge 
between any v$ and v®. Then, in the second step, identify from 
G a vertex w which has not been accessed by the sampler - 
i.e., w % {vi, . . . ,Vk} - and insert into the graph an edge 
e : (w,w°). One can see that the only cross-cutting edge in 
the output graph G' is (w,w°) - i.e., e : (vi,Vj) cannot be a 
cross-cutting edge for G'. An intuitive demonstration of the 
proof is shown in Fig [2] ■ 

It is important to note from the theorem, however, that 
it still leaves two possible ways for one to increase the 
conductance of a social network based on only the local 
neighborhood knowledge: (1) While the theorem indicates that 
it is impossible to deterministically confirm the cross-cutting 
nature of an edge, it may still be possible to deterministically 
disprove an edge from being cross-cutting - i.e., we may prove 
that an edge is definitely non-cross-cutting based on just local 
neighborhood knowledge, and therefore remove it to increase 
the conductance deterministically. (2) It is still possible to 
conditionally or probabilistically evaluate the likelihood of 
an edge being cross-cutting - e.g., we may determine that 
an edge absent from the original graph is more likely to be 
a cross-cutting edge (if added) than an existing edge, and 
thereby replace the existing edge with the new one to increase 
the conductance in a probabilistic fashion. We consider the 
removal and replacement strategies, respectively, in the next 
two subsections. 

B. Deterministic Identification of Non-cross-cutting Edges 

To illustrate the main idea of our deterministic identification 
of non-cross-cutting edges (for removals), we start with an 
example in Fig [3] to show why we can determine, based solely 
upon the local neighborhoods of u and v as shown in the graph, 
that e : (u, v) (henceforth denoted by e uv ) in the Fig is not a 




Fig. 3. A figure shows that the edge e uv cannot be the cross-cutting edge 
in theorem [3] Locally, (a) and (c) have 6 cross-cutting edges, while (b) and 
(d) only have 5 of them. 



cross-cutting edge. The intuition behind this is fairly simple: 
When u and v share a large number of common neighbors 
(e.g., 5 in Fig [3]) but have relatively few other edges (e.g., 1 
each in Fig B), it is highly unlikely for the partition to cut 
through e uv rather than the other edges of u and v - e.g., 
(u, uq) in Fig |3]- if it cuts through any edges associated with 
u and v at all. 

The rigid (dis-)proof can be constructed with contradiction. 
Suppose e uv is a cross-cutting edge between two partitions of 
the graph, S and S. One can see that since u and v belong to 
different partitions, there must be at least 6 cross-cutting edges 
in the subgraph (Fig [3] (a) depicts an example). We now show 
in the following discussion that this is actually impossible 
because one can always construct another partition S' and 
S' (by "dragging" u and v into the same part) and reduce 
the number of cross-cutting edges to at most 5. Note that this 
contradicts the definition of S and S being a configuration 
which minimizes the number of cross-cutting edges. Thus, e uv 
cannot be a cross-cutting edge. 

To understand how the construction of S' and S' works, 
consider Fig [3] (b) as an example. For the partition illustrated 
in Fig [3] (a), we can "drag" u into S to form the new 
configuration, such that the number of cross-cutting edges 
associated with u and v is now at most 5, as shown in Fig|3](b). 
Note that the other edges not shown in the subgraph (no matter 
cross-cutting or not) are not affected by the re-configuration, 
because all vertices associated with u are already known in 
the local neighborhood of u (shown in Fig [3]). 

More generally, for the other possible settings of S and S 
(such as Fig [3jc)), one can construct the re-configuration in 
analogy with the following general principle: First, find the 
"more popular" partition (i.e., either S or S) among the 5 
common neighbors of u and v (e.g., S in Fig [3] (a) or Fig [3] 



(c)). Then, drag one of u and v to ensure that both of them 
are in this more popular partition under the new configuration. 
One can see that, since at most 2 common neighbors of u 
and v are in the less popular partition, the number of cross- 
cutting edges under the new configuration is at most 2*2 + 1, 
where 2 * 2 is the number of cross-cutting edges associated 
with the 2 common neighbors in the less popular partition (at 
most 2 for each), and 1 is the number of cross-cutting edge 
associated with the other (non-common) neighbor of the node 
being dragged (i.e., u° in Fig [3] (a)). 

The following theorem depicts the general case for which 
we can remove an edge on-the-fiy to increase the graph 
conductance. Recall that N(u) and k u represent the set of 
neighbors and the degree of a node u, respectively. 

Theorem 3: [Edge Removal Criteria]: Given G{V,E), 
Vit, v G V, if e uv G E and 



\N(u)nN(v)\ 



1 > ^ max{fc M , k v }, 



(7) 



then e MV is not a cross-cutting edge. 

Proof: Let n — \N(u)f]N(v)\, without losing generality, 
assuming u G S,v G S, then there must be n cross-cutting 
edges in these n disjoint paths of length 2 between u and 
v. We denote the number of cross-cutting edges in 

these n paths connected with u and v, so n u + n v = n. One 
can see that if we try to "drag" u from u G S to u G S, all 
the edges connected with u would be modified, e.g. flip the 
edges linked to u, which means the old cross-cutting edges 
will be the new non-cross-cutting edges, and vice versa. As 
the assumption from inequality (mi: [^] +1 > \ max{fc Hl k v }, 
so either n u + 1 > \k u or n v + 1 > \k v holds. Without 
losing generality, assuming for vertex u the inequality holds, 
we change u from set S to S, so the number of cross-cutting 
edges must be strictly decreasing. Since we have assumed that 
the number of edges in S or S is much greater than the number 
of cross-cutting edges, so Q(G) must decrease according to 
the decrease of the number of cutting-edges, which leads to 
the contradiction of e uv is a cross-cutting edge. ■ 

Due to space limitations, please refer to the technical report 
[23] for the proofs of all theorems in the rest of the paper. 
Intuitively, theorem [3] gives us a clue that if two nodes have 
enough common neighbors, then we can deterministically say 
that the edge between them is non-cross-cutting. Moreover, (|7]) 
is tight - i.e., if it does not hold, then we can always construct 
a counter example where e uv is cross-cutting - as shown in 
the following theorem. 

Corollary 1: For all N(u),N(v),k u ,k v which satisfy 



\N(u)DN(v)\ 
2 



1 < - max{fc u , k v } 7 



(8) 



there always exists a graph G(V, E) in which 
cutting. 



is cross- 



(a) 



(b) 



Fig. 4. Replace the edge e uv with e uv 



(a) 



(b) 



Fig. 5. A demo shows that e uv cannot be a cross-cutting edge in theorem 

rvfci 



Running Example: With our on-the-fly edge removals, 
any random walk is essentially following an overlay 
topology G* which can be constructed by applying 
Theorem [3] to every edge in the original graph G. For 
the bar-bell running example, the solid lines in Fig [T] 
depicts G*. The conductance is now $(G*) = 0.053. 
Compared with the original conductance of 0.018, the 
corresponding lower bound on mixing time is reduced 
to 1638.3/14212.3 = 0.115 of the original value - a 
reduction of 89%. 



will decrease the conductance or have no effect. 



C. Conditional Identification of Cross-cutting Edges 

We now describe our second idea of conditionally identify- 
ing cross-cutting edges. We start with an example in Fig|4]to 
show why we can replace an existing edge with a new one 
such that (1) the new edge is more likely to be crosscutting, 
and (2) the replacement is guaranteed to not decrease the 
conductance. 

Specifically, consider the replacement of e uv by e uw given 
the neighborhoods of u and v. A key observation here is that 
e uv and e vw cannot be both cross-cutting edges. The reason 
is that otherwise we could always "drag" v into the same 
partition as u and w to reduce the number of cross-cutting 
edges by at least 1. Given this key observation, one can see 
that the replacement of e uv by e uw will only have two possible 
outcomes: 

• if e uv is a cross-cutting edge, then e uw must also be a 
cross-cutting edge because, due to the observation, e vw 
cannot be a cross-cutting edge. Thus, the replacement 
leads to no change on the graph conductance. 

• if e uv is not a cross-cutting edge, then replacing it with 
e U w will either keep the same conductance, or increase 
the conductance if e uw is cross-cutting. 

As such, the replacement operation never reduces the conduc- 
tance, and might increase it when e uw is cross-cutting. More 
generally, we have the following theorem. 

Theorem 4: Given G{V, E), \fv G V, if k v = 3, u, w G 
N(v), then replacing edge e uv with e uw will not decrease the 
conductance, while it also has positive possibility to increase 
the conductance. 

Next, we are going to prove that k v = 3 is actually the 
only case when replacement is guaranteed to not reduce the 
conductance, as shown by the following corollary. 

Corollary 2: For v G V, if k v ^ 3, then there always exist 
a graph G(V, E), Vu, w G N(v), such that replacing e uv with 



Running Example: With Theorem VIII an example of 
the replacement operations one can perform over the 
bar-bell running example in Fig [T] is to replace e ur with 
e rv , given that u (after edge removals) has a degree 
of 3. Compared with the original conductance of $(G) 
= 0.018 and the post-removal conductance of $(G*) 
= 0.053, the conductance is now further increased to 
3>(G**) _ 0.105. The corresponding lower bound on 
mixing time is reduced to 416.6/1638.3 = 0.25 of the 
post-removal bound - a further reduction of 75% - and 
416.6/14212.3 = 0.029 of the original bound - an overall 
reduction of 97%. 



D. Extension 

If we know more about the user's neighbors, especially the 
common neighbors of the user and the random walk's next 
candidate, we will deterministically identify more non-cross- 
cutting edges. When the random walk reaches the nodes we 
have accessed before, we can use their degree information 
without issuing extra web requests since we could retrieve 
data from our local database. 

Fig [5] (a) shows an example that with the extra degree 
knowledge of u and u's common neighbor w, e uv must be 
a non-cross-cutting edge. As k w = 3, if we assume e uv is a 
cross-cutting edges, then there must be 3 cross-cutting edges 
between u and v. However, there exists another configuration 
Fig [5] (b), which only has 2 cross-cutting edges. Thus, it 
contradicts the assumption that e uv is a cross-cutting edge. 
Noticed that if we do not know the degree of w, we could not 
deterministically identify e uv since theorem [3] does not apply 
here. 

Theorem 5: Given G(V, E), Vu, v G V, if e uv G E and 



\N(u)nN(v)\ -N* 



-1- 



(4—k w ) > ^max{fc tt ,fc„}, 



(9) 

we can assert that e uv is not a cross-cutting edge. Here we 
denote N* = {w e N(u) n N(v)\ k w is known ,2 < k w < 
3}. 

Intuitively, the edge between two nodes which have many 
common neighbors has higher probability to be a non-cross- 
cutting edge. Also, it is easy for us to find these edges in 
online social networks. If a friend knows almost every other 



friends of a person, then this edge may be considered as non- 
cross-cutting edge according to theorem [3] and VIII 



IV. Algorithm MTO-Sampler 
A. Algorithm implementation 

Algorithm description. To explain how the on-the-fly modi- 
fication works, we demonstrate an example in Fig [6] Fig 6(a) 
is an overlay graph G* that has been modified according to 
former theorems, in which edges A, C and D are removed, 



and edge B is replaced. Fig 6(b) shows one possible track of 
how our MTO-sampler change the simple random walk. For 
instance, when the random walk sees a node u, and k u = 3 
(it satisfies the condition of replacement), then it may replace 



an edge as we described in theorem VIII The colored area 
contains all the nodes that the random walk visits. 

Algorithm 1 depicts the detailed procedure of MTO sampler, 
and the stopping rule (which indicates that the random walk 
should stop and output samples) can be any convergence 
monitor used in Markov Chain. 

Algorithm 1 MTO-Sampler for Simple Random Walk 
for i = 1 — > samplesize do 
Starting from vertex u 
while !(Stopping rule) do 
while \N{u)\ > 1 do 

Uniformly pick a neighbor v, and issue a query 
if e uv is removable then 
N(u) <r- N(u) - {v} 
continue 
else if k v == 3 then 

/* One of u's edge can be replaced*/ 
if choose to replace e uv then 

v <— v' 
else 

N(u) <- N(u) U {«'} 
choose u 4— v or u <— v' randomly 
break 
end if 
end if 

if rand(0, 1) < 1/2 then 
u <— v 
break 
else 

continue 
end if 
end while 
end while 

Record sample Xi <— u 
end for 



Aggregate estimation and probability revision. After col- 
lecting samples, we use Importance Sampling to directly 
estimate the aggregate information through the samples from 
the random walk's stationary distribution r. 



Importance Sampling: 

for i = 1 to SampleJSize N do 
Xi <— sampling from r 

record f(xi) /* Aggregate Function /(•) */ 
end for 



Output estimation A(f(X)) 



The key challenge for MTO-Sampler using importance 
sampling is to estimate the stationary distribution of MTO- 
Sampler random walk r*. Since MTO-Sampler modifies the 
topology, r* may not equal to the stationary distribution r. 
Here we have 

T*(u) = J^-p (10) 

fc* is unknown in overlay graph G* , but we can draw simple 
random sample from u's neighbors in G* to get an unbiased 
estimation of fc*. 

B. Theoretical Model Analysis 

In order to theoretically analysis the performance of MTO- 
Sampler, we introduce a well known graph generation model: 
Latent space model. 

Latent space model. Latent space graph model [21] are 
connecting two nodes with the probability related to their 
distance in the latent space. 



P(i~j\d ij ) = 



1 



1 -|- e a(dy-r) ' 



(11) 



here dij is the distance between two nodes i and j; r controls 
the level of sociability of a node in this graph, and a is the 
sharpness of the function. 

We will show that in the following theorem if two nodes' 
distance is smaller than a threshold do, then it is likely to be an 
non-cross-cutting edge. Therefore, after finding the expected 
number of edges that can be removed we can calculate the 
increment of the conductance. 

Theorem 6: Given a latent space graph model G(V, E), 
assume a = +oo, then the expected number of edges we 
can removed 



2.[./,'] >_\E\-V[d< V(r) [ 1 - ( * 



(12) 



here V(r) is the volume of a hypersphere with radius r in D 
dimensional latent space. The proof can be found at [23]. 

Simple simulations show that from 20000 points experi- 
ment, one can get the empirical distribution of point-wise 
distance. More specifically, If we let r = 0.7, a = 4 and 
b = 5, D = 2, then 

E[$(G*)] > 1.052$(G) (13) 

We compared the experimental results together with this 



theoretical bound of latent space model in section V-B 



(a) Modified overlay graph G* (b) Carry out the random walk by modify the topology on-the-fly. It 

is identical to the random walk in overlay graph G*. 

Fig. 6. A demo shows how the MTO-Sampler modifies the topology of the graph on-the-fly. 



Dataset 


#nodes 


#edges 


90% diameter 


Epinions [19] 


26588 


100120 


4.8 


Slashdot A [16] 


70068 


428714 


4.5 


Slashdot B [16] 


70999 


436453 


4.5 



TABLE I 
Local Datasets 



V. Experiments 
A. Experimental Setup 

1 ) Hardware and Platform: We conducted all experiments 
on a computer with Intel Core i3 2.27GHz CPU, 4GB RAM 
and 64bit Ubuntu operating system. All algorithms were 
implemented in Python 2.7. Our local, synthetic and online 
datasets are stored in the in-memory Redis database and the 
MongoDB database. 

2) Datasets: We tested three types of datasets in the 
experiments: local real-world social networks, Google Plus 
online social network, and synthetic social networks - which 
we describe respectively as follows. 

Local Datasets: The local social networks - i.e., real-world so- 
cial networks for which the entire topology is downloaded and 
stored locally in our server. For these datasets, we simulated 
the individual-user-query-only web interface strictly according 
to the definition in Section 1, and ran our algorithms over 
the simulated interface. The rationale behind using such local 
datasets is so as we have the ground truth (e.g., real aggregate 
query answers over the entire network) to compare against for 
evaluating the performance of our algorithms. 

Table U shows the list of local social networks we tested 
with (collected from [1]). All three datasets are previously- 
captured topological snapshots of Epinions and Slashdot, two 
real-world online social networks. Since we focus on sampling 
undirected graphs in this paper, for a real-world directed graph 
(e.g., Epinions), we first convert it to an undirected one by only 
keeping edges that appear in both directions in the original 
directed graph. Note by following this conversion strategy, we 
guarantee that a random walk over the undirected graph can 
also be performed over the original directed graph, with an 
additional step of verifying the inverse edge (resp. v — > u) 
before committing to an edge (resp. u —> v) in the random 



walk. The number of edges and the 90% effective diameter 
reported in Table [I] represent values after conversion. 

Google Plus Online Social Graph: We also tested a second 
type of dataset: remote, online, social networks for which we 
have no access to the ground truth. In particular, we chose 
the Google Plu^] network because its APp] is the most 
generous among what we tested in terms of the number of 
accesses allowed per IP address per day. Using random walk 
and MTO-Sampler random walk, we have accessed 240,276 
users in Google Plus. We observed that the interface provided 
by Google Social Graph API strictly adheres to our model of 
an individual-user-query-only web interface, in that each API 
request returns the local neighborhood of one user. We also 
collected the data of users' self-description. 

Synthetic Social Networks: One can see that, for the real- 
world social network described above, we cannot change 
graph parameters such as size, connectivity, etc, and observe 
the corresponding performance change of our algorithms. To 
do so, we also tested synthetic social networks which were 
generated according to theoretical models. In particular, we 
tested the latent space model. 

We note that, since the effectiveness of these theoretical 
models are still under research/debate, we tested these syn- 
thetic social networks for the sole purpose of observing the 
potential change of performance for social networks with 
different characteristics. The superiority of our algorithm over 
simple random walk, on the other hand, is tested by our 
experiments on the two types of real-world social networks. 

3) Algorithms Implementation and Evaluation: Algo- 
rithms: We tested four algorithms, the simple random walk 
(i.e., baseline), Metropolis Hastings Random Walk (MHRW), 
Random Jump (RJ) and our MTO-Sampler, and compared their 
performance over all of the above-described datasets. 

Input Parameters: Both simple random walk and our MTO- 
sampler are parameter-less algorithms with one exception: 
They both need a convergence indicator to determine when 

' https ://plus . google .com/ 

"The source code of its Python wrapper can be found at 
https://github.com/pct/python-googleplusapi. After April 20, 2012, this 
social graph api will be fully retired. 



the random walk has reached (or become sufficiently close to) 
the stationary distribution - so a sample can be retrieved from 
it. In the experiments, we used the Geweke indicator [9], one 
of the most popularly used methods in the literature, which 
we briefly explain as follows. 

Given a sequence of nodes retrieved by a random walk, the 
Geweke method determines whether the random walk reaches 
the stationary distribution after a burn-in of k steps by first 
constructing two "windows" of nodes: Window A is formed 
by the first 10% nodes retrieved by the random walk after 
the fc-step burn-in period, and Window B formed by the last 
50%. One can see that, if the random walk indeed converges to 
the stationary distribution after burn-in, then the two windows 
should be statistically indistinguishable. This is exactly how 
the Geweke indicator tests convergence. In particular, consider 
any attribute that can be retrieved for each node in the 
network (a commonly used one is degree that applies to every 
graph). Let 



where Oa and 6g are means of 8 for all nodes in Windows A 
and B, respectively, and and Sf are their corresponding 
variances. One can see that Z —> when the random walk 
converges to the stationary distribution. Thus, the Geweke 
indicator confirms convergence if Z falls below a threshold. In 
the experiments, we set the threshold to be Z < 0.1 by default, 
while also performing tests with the threshold ranging from 
0.01 to 1. 

Performance Measures for Sampling: As mentioned in 
Section |II-B| a sampling technique for online social net- 
works should be measured by query cost and bias - i.e., 
the distance between the (ideal) stationary distribution (i.e., 
p{v) — deg(v) / J2 V deg(v) for a simple random walk) and the 
actual probability distribution for each node to be sampled. To 
measure the query cost, we simply used the number of unique 
queries issued by the sampler. Bias, on the other hand, is more 
difficult to measure, as shown in the following discussions. 

For a small graph, we measured bias by running the sampler 
for an extremely long amount of time (long enough so that 
each node is sampled multiple times). We then estimated the 
sampling distribution by counting the number of times each 
node is retrieved, and compared this distribution with the ideal 
one to derive the bias. In particular, we measured bias as 
the KL-divergence between the two distributions, specifically 
D KJj (P\\P sam ) + D Kh (P sam \\P), where P and P sam are the 
ideal distribution and the (measured) sampling distribution, 
respectively. 

For a larger graph, one may need a prohibitively large 
number of queries to sample each node multiple times. To 
measure bias in this case, we use the collected samples to 
estimate aggregate query answers over all nodes in the graph, 
and then compare the estimation with the ground truth. One 
can see that, a sampler with a smaller bias tends to produce an 



estimation with lower relative error. Specifically, for the local 
social networks, we used the average degree as the aggregate 
query (as only topological information is available for these 
networks). For the Google Social Graph experiment, we tested 
various aggregate queries including the average degree and the 
average length of user self-description. 

Finally, to verify the theoretical results derived in the paper, 
we also tested a theoretical measure: the mixing time of the 
graph. In particular, we continuously ran our MTO-Sampler 
until it hits each node at least once - so we could actually 
obtain the topology of the overlay graph (e.g., as in Fig [TJ. 
Then, we computed the mixing time of the overlay graph 
(from the Second-Largest Eigenvalue Modulus (SLEM) of its 
adjacency matri)|^] see [6]). We would like to caution that, 
while we used it to verify our theoretical results of MTO- 
Sampler never decreasing the conductance of a graph, this 
theoretically computed measure does not replace the above- 
described bias vs. query cost tests because it is often sensitive 
to a small number of "badly-connected" nodes (which may 
not cause significant bias for practical purposes). 

B. Performance Comparison Between Simple Random Walk 
and MTO-Sampler 

We started by comparing the performance of Simple Ran- 
dom Walk (SRW) and MTO-Sampler over real-world social 
networks using all three performance measures described 
above - KL-divergence, relative error vs. query cost, and 
theoretical mixing time. 

Local Datasets: We started by testing the relative error 
vs. query cost tradeoff of SRW, MTO, MHRW and RJ for 
estimating aggregate query answers. Since only topological 
information is available for local datasets, we used the average 
degree as the aggregate query. Fig [7] depicts the performance 
comparison for the three real-world social networks. Here each 
point represents the average of 20 runs of each algorithm, and 
the query cost (i.e., y-axis) represents the maximum query 
cost for a random walk to generate an estimation with relative 
error above a given value (i.e., x-axis). For random jump in 
the experiments, we set the probability of jumping as 0.5. One 
can see that, for all three datasets, our MTO-Sampler achieves 
a significant reduction of query cost compared with the SRW 
sampler, MHRW sampler and Random Jump sampler. 

We also tested the KL-divergence measured by performing 
an extremely long execution of SRW and MTO in Fig [8] - 
with each producing 20000 samples - to estimate the sampling 
probability for each node. The Geweke threshold was set 
to be 0.1 for the test. One can see that our MTO-Sampler 
not only requires fewer queries for generating each sample 
(i.e., converges to the stationary distribution faster), but also 
produces less bias than the SRW sampler. 

To further test the bias of samples generated by our MTO- 
Sampler, we also conducted the test while varying the Geweke 
threshold from 0.1 to 0.8 on the dataset Slashdot B. Fig [9] 

12 Typical theoretical mixing time of Simple Random Walk can be denned 
as 0(1/ log(l//x)), where fi is SLEM of transition matrix P. 
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Fig. 7. Bias vs. Query Cost tests for local datasets' average degree. 
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Fig. 8. Comparison between SRW and MTO on Fig. 9. Varying Geweke Threshold to get different Fig. 10. Comparison of theoretical mixing time 

query cost and the Kullback-Leibler divergence KL divergence on dataset Slashdot B. KL and on latent space graph model. MTO_Both: Remove 

measure defined in Section IV-A.3I over all three QC stands for KL divergence and Query Cost and replace edges. MTO_RM: Only remove edges, 

datasets. respectively. MTO_RP: Only replace edges. 



depicts the change of measured bias for SRW and MTO, 
respectively. One can see from the figure that our MTO- 
Sampler achieves smaller bias than SRW for all cases being 
tested. In addition, a smaller threshold leads to a smaller bias 
and larger query cost, as indicated by the definition of Geweke 
convergence monitor. 

Google Plus online social network: For Google Plus, we 
do not have the ground truth as the entire social network 
is too large (about 85.2 million users in Feb 2012^ 

to be 

crawled. Thus, we performed the tests in two steps. First, we 
continuously ran each sampler until their Geweke convergence 
monitor indicated that it had reached its stationary distribution. 
We then used the final estimation as the presumptive ground 
truth which we refer to as the converged value. In the second 
step, we used the converged value to compute the relative error 
vs. query cost tradeoff as previously described. 



Fig 11(a) shows the estimated average degree when running 
SRW and MTO-Sampler random walk on Google Plus. It 
clearly shows that MTO-Sampler's variance is smaller and 
converges faster than simpler random walk. Fig |1 l(b)| and 
11(c) illustrate the comparison between SRW and MTO of 



the relative error vs query cost of multiple attributes. We note 
that the self-description length is the number of characters in 



users' self-description. One can see that our MTO-Sampler 
significantly outperforms SRW. 

Synthetic Social Networks: Finally, we conducted further 
analysis of our MTO-Sampler, in particular the individual 
effects of edge removals (RM) and edge replacements (RP), 
using the synthetic latent space model described in Section V- 



A. 2 Fig 10 depicts the results when the number of nodes in 
the graph varies from 50 to 100 (with the latent space model, 
we distributed these nodes in an area of [0,4] x [0,5], and 
set r = 0.7). We derived the theoretical mixing time from the 
second largest eigenvalue modulus of the transition matrix. 



''Estimated by Paul Allen's model, http://goo.gl/nZCzN 



Note that Fig 10 also includes the theoretical bound derived 
in Section 4.2. One can see from the figure that our final 
MTO-Sampler achieves better efficiency than the individual 
applications of edge removal and replacement. In addition, 
the theoretical model represents a conservation estimation that 
is outperformed by the real efficiency of MTO-Sampler - 
consistent with our results in Section 4.2. 

VI. Related Work 

Sampling from online social networks. Several papers [2], 
[13], [15] have considered sampling from general large graph, 
and [10], [12], [18] focus on sampling from online social 
networks. 




Fig. 11. Google Plus online social network 



With global topology, [15] discussed sampling techniques 
like random node, random edge, random subgraph in large 
graphs. [11] introduced Albatross sampling which combines 
random jump and MHRW. [10] also demonstrated true uniform 
sampling method among the users' id as "ground-truth". 

Without global topology, [10], [15] compared sampling 
techniques such as Simple Random Walk, Metropolis-Hastings 
Random Walk and traditional Breadth First Search (BFS) and 
Depth First Search (DFS). Also [4], [10] considered many 
parallel random walks at the same time, and MTO-sampler 
can be applied to each parallel random walk straightforwardly, 
since it is an parameter-free and online algorithm. 

Moreover, to the best of our knowledge, random walk is still 
the most practical way to sampling from large graphs without 
global topology. 

Shorten the mixing time of random walks. [18] found 
that the mixing time of typical online social networks is 
much larger than anticipated, which validates our motivation to 
shorten the mixing time of random walk. [5] derived the fastest 
mixing random walk on a graph by convex optimization on 
second largest eigenvalue of the transition matrix, but it need 
the whole topology of the graph, and its high time complexity 
make it inapplicable in large graphs. 

Theoretical models of online social network. [22] compared 
latent space model with real social network data. [8] intro- 
duced hybrid graph model to incorporate the small world phe- 
nomenon. [20] also measured the difference between multiple 
synthetic graphs and real world social network graphs. 

VII. Conclusions 
In this paper we have initiated a study of enabling faster 
random walk over an online social network (with a restrictive 
web interface) by "rewiring" the social network on-the-fly. We 
showed that the key for speeding up a random walk is to 
increase the conductance of the graph topology followed by 
the random walk. As such, we developed MTO-S ampler which 
provably increases the graph conductance by constructing 
an overlay topology on-the-fly through edge removals and 
replacements. We provided theoretical analysis and extensive 
experimental studies over real-world social networks to illus- 
trate the superiority of MTO-Sampler on achieving a smaller 
sampling bias while consuming a lower query cost. 
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VIII. Appendix 



Corollary 1. For all N(u), N(v), k u , k v which satisfy 



\N(u)r\N(v)\ 
2 



1 < - max{fc„, k v }, (15) 



there always exists a graph G(V,E) in which e uv is cross- 
cutting. 



and then we divide these nodes into two sets S and S. 

Suppose n is even. In order to achieve the minimum in the 
definition of conductance, there must exist the case such that 
we only need to decide whether node u is in S or in S. 



# Cross-Cutting Edges = 



1, 
\0„ 



if u G S 

if lie S 



(18) 



If \O u \ > 1, we can easily assert that e uv is a cross-cutting 
edge. If \O u \ = 1, we can let \a(S)\ = \a(S)\ when u £ S 
to minimize min{|a(5)|, |a(5)|}. So e uv is an cross-cutting 
edge under this circumstance. 

Also, suppose n is odd. Similarly, we have 



# Cross-Cutting Edges = 



n + 1, 
2LfJ +\o v 



if u e S 
if u e S 



(19) 



Since \O u \ > 2, in the same way we know that e h 
cutting edge. 



is a cross- 



Theorem 4. Given G(V,E), Vw G V, if k v = 3, u,w G 
N(v), then replacing edge e uv with e uw will not decrease the 
conductance, while it also has positive possibility to increase 
the conductance. Proof: First, no matter e uv is a cross- 



cutting edge or not, replace it with e uw should at least obtain 
the same conductance. If e uv is not a cross-cutting edge, 
then obviously we are not going to decrease the conductance 
because a(S) or a(S) will not change. If e uv is a cross-cutting 
edge, we only need to prove that e uw is also a cross-cutting 
edge. Let's assume e uw is not a cross-cutting edge, then we 
can infer that the e vw is a cross-cutting edge. But v only has 
degree of 3, so it is obvious that letting u, v and w be the 
same side will achieve less conductance, which contradicts the 
definition of conductance. 

But if e vw is a cross-cutting edge, and we replace e uv with 
e uw , then e uw has the positive probability to become one more 
cross-cutting edge in this local view of u, v and w, which result 
in higher conductance. ■ 



Proof: Let n — \N(u)PiN(v)\. We only need to construct 

is a 



a counter-example for each case that satisfies ( 15 I, but e 



cross-cutting edge. Assume we have a graph like Fig 12 which 
shows the whole view of it. We let the number of common 
neighbors of node u and v be n. Assuming k u > k v , from 



( 15 i we get: 



\O u \ = max{fc„, k v } — n — 1 > 1, if n is even 
\O u \ = max{fc u , k v } — n — 1 > 2, if n is odd 



(16) 



Here O u = {e wu \w G N(u) — N(v) U {v}}, which denotes 
the outer edges of u which is not linked to the node v and 
their common neighbors. We can carefully construct a graph 



like Fig 12 for each neighbor of node u and v, it only has 
1-degree neighbors. So we need to prove that after assigning 
the degree for each node, e uv will be a cross-cutting edge. If 
we simply let: 



k w ^> max{k u , k v } \/w G {V — {u, v}} 



(17) 



Corollary 2. For v G V, if k v ^ 3, then there always 
exist a graph G(V,E), Vu,w G N(v), such that replacing 
e uv with e uw will decrease the conductance or have no effect. 

Proof: If k v — 1, then we could not cut it to disconnect the 
graph. If k v = 2, we need to check some possible situations. 
If none of these edges linked to v are cross-cutting edges, 
then replacing would not has effect on the conductance. If 
either e uv or e wv is a cross-cutting edge, then replace one of 
them with e uw will not generate another cross-cutting edge; 
because now k v = 1, and it should belongs to one side of the 
separation, S or S. 

So we only need to consider the situation when k v > 4. See 
Fig 13 There exist the case when both e uv and e wv are cross- 
cutting edges. Then replacing e uv with e wv would decrease the 
number of cross-cutting edges from 2 to 1 locally, which may 
lead to dramatic decrease of the conductance of the graph. 




following inequality would hold: 



1 > - (max{fe u , 

2|JV*|- J] 
wen* 

1 > r (max{fc„, fc„}) 



2) 



- [2|JV*|- X! (*»-2)J 

\ wEN' / 



Fig. 14. N* is the set we have accessed before and whose nodes are of 
degree 2 and 3. We do not know the blue nodes' degree. 



The uniqueness of k v — 3 is that there would not exist the 
case when both e uv and e wv are cross-cutting edges. ■ 

Theorem 5. Given G(V, E), Vit, v £ V, if e uv £ E and 



\N{u)C\N{v)\ -N* 



+ 1+^ 



\ " 



2 ^ 



(4-fc w ) > - max{fc„, 
(20) 



we can assert that e uv is not a cross-cutting edge. Here we 
denote N* — {w £ N(u) n AT(u)| fc^ is known ,2 < k w < 
3}. Proof: Noticed that if we do not know any degree 

information about the common neighbors of u and v, then 
N* = 0, and theorem VIII is exactly the same as theorem B] 
We are going to prove this theorem by contradiction, which 
means if we assume e uv is a cross-cutting edge, then we can 
find another configuration of S and S such that e uv is not 
a cross-cutting edge but obtain less conductance. Again, let 
n = \N(u) n N(v)\, according to the assumption the number 
of common neighbors of u and v is n, then there must be n+1 



cross-cutting edges in this local view of the graph, see Fig 14 



Given a node w £ N(u) n N(v), and according to some 
historical information we can achieve its degree k w without 
paying any query cost. So obviously, if k w > 4 then it makes 
no sense to consider the rearrangement of it because dragging 
w from S to S would probably increase the number of cross- 
cutting edges without knowing the edge information outside 
this local view of the graph. Therefore, we only need to 
consider N*, which is the set of all the nodes belongs to 
common neighbors of degree 2 and 3. 

if we denote that the number of cross-cutting edges linked 
to u within N*\j{u\ is , the number of cross-cutting edges 
linked to u outside N*\j{u\ is n"u \ and similarly we have n« 



and ■ So we have n!$ 



0) , (*) 

1 V / 1^ iy, \ J 



- riv = n. According 



Without losing generality, assume the first one holds, then we 
are going to prove that by rearrange the set N* U {u} we can 
achieve a lower conductance and thus lead to the contradiction. 

Imagine that if we try to drag the whole set of N*U{u} from 
S to S, then we need to "rearrange" all the edges linked to the 
set: those cross-cutting edges linked to the set but outside N* 
will be "fliped", i.e. from cross-cutting edges to non-cross- 
cutting edges and vice versa; those cross-cutting edges linked 
to the set but inside N* will be eliminated, otherwise there will 
be two cross-cutting edges linked to the node in N*, which is 
impossible because Vw £ N*, 2 < k w < 3. 

Let N , u{u} = {e vw £ E\v £ (N* U {u}),w <£ (N* U 
{u})}, then 



\O n * u{u} \ =m&x{k u ,k v }+ k w -2\N*\. (21) 

wEN* 

And we know that the minimum number of cross-cutting edges 

1 + N*. So as 



we can manipulate will be at least 
the result of one line calculation of 



n-N* 



N* 



+ 1 + JV" 



> ^\O n -u{u}\- 



(22) 



Therefore moving the set A^* U {u} from S to S will always 
results in a lower conductance. ■ 

Theorem 6. Given a latent space graph model G(V,E), 
assume a = +oo, then the expected number of edges we can 
removed 

E[R] > \E\ ■ P (d < V(r) - 1 J\ (23) 

Moreover, if we assume the dimension D = 2, and nodes are 
uniformly distributed in a rectangle [0, a] x [0,6], then for the 
graph G* (after removing edges from G) is: 

$(G) 



E[$(G*)] > 



(24) 



to the condition described in Proposition VIII either of the 



J+z|<0.75r 2 fa( z l)fb(z2)dz 1 Z 2 

where Z\ and z 2 are independent uniform random variable 
supported on [0, a] and [0, 6]. 



Proof: 



According to [21], we have 

V(r) fl-^V< 



\N(i)nN(j)\ 



\N(i)UN(j)\ 



(25) 



V(r) is the volume of a D dimensional hypers phere of radius 
r. Therefore, if we have small enough dij, than we can confirm 
that we can remove the edge ey . Conservatively, from theorem 



[3] we can reasonably assert that if \N(i) fl N(j)\ > \N(i) U smaller than the threshold is 
N(j) \ — 2, then the edge ey can be safely removed. So when 



dij < 2r 1 - 



V(r 



|iV(i)uiV(j)| 



the edge e^j can be removed. Now, we have transformed the 
probability of removing an edge to the probability of two 
node's distance is within a threshold. Since \N(i)UN(j)\ > 3, 
so ( |23] i holds. 

Given more assumptions of dimension and the distribution 
of nodes, the probability of two nodes' euclidean distance 



nd<d Q )= // fa{zi)h(z 2 )dz lZ2 . (27) 

J Jzl+zl<d 2 

Also, since \N(u) fl N(v)\ > 3, the change of conductance 
(26) can be calculated as 

W(S)\ 



E[$(G*)] = 



a(S) - P(d < do)a(S) 
' <f(G) 



> 



1-F(d< d ) 
<HG) 

1 - //*»+**<0.75ra fa{zi)fb{z2)dz l Z 2 ' 



(28) 
(29) 
(30) 



